{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7765UFHoyGx6"
      },
      "source": [
        "##### Copyright 2019 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "KVtTDrUNyL7x"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xPYxZMrWyA0N"
      },
      "source": [
        "# Boosted trees using Estimators"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "p_vOREjRx-Y0"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://www.tensorflow.org/tutorials/estimator/boosted_trees\"><img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/estimator/boosted_trees.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/docs/blob/master/site/en/tutorials/estimator/boosted_trees.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\">View source on GitHub</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/estimator/boosted_trees.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6gWdn5lrlkhR"
      },
      "source": [
        "> Warning: Estimators are not recommended for new code.  Estimators run `v1.Session`-style code which is more difficult to write correctly, and can behave unexpectedly, especially when combined with TF 2 code. Estimators do fall under our [compatibility guarantees] (https://tensorflow.org/guide/versions), but will receive no fixes other than security vulnerabilities. See the [migration guide](https://tensorflow.org/guide/migrate) for details."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qNW3c_rop5J8"
      },
      "source": [
        "**Note**: Modern Keras based implementations of many state of the art decision forest algorithms are available in [TensorFlow Decision Forests](https://tensorflow.org/decision_forests)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dW3r7qVxzqN5"
      },
      "source": [
        "This tutorial is an end-to-end walkthrough of training a Gradient Boosting  model using decision trees with the `tf.estimator` API. Boosted Trees models are among the most popular and effective machine learning approaches for both regression and classification. It is an ensemble technique that combines the predictions from several (think 10s, 100s or even 1000s) tree models.\n",
        "\n",
        "Boosted Trees models are popular with many machine learning practitioners as they can achieve impressive performance with minimal hyperparameter tuning."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eylrTPAN3rJV"
      },
      "source": [
        "## Load the titanic dataset\n",
        "You will be using the titanic dataset, where the (rather morbid) goal is to predict passenger survival, given characteristics such as gender, age, class, etc."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KuhAiPfZ3rJW"
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
        "from IPython.display import clear_output\n",
        "from matplotlib import pyplot as plt\n",
        "\n",
        "# Load dataset.\n",
        "dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')\n",
        "dfeval = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv')\n",
        "y_train = dftrain.pop('survived')\n",
        "y_eval = dfeval.pop('survived')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NFtnFm1T0kMf"
      },
      "outputs": [],
      "source": [
        "import tensorflow as tf\n",
        "tf.random.set_seed(123)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3ioodHdVJVdA"
      },
      "source": [
        "The dataset consists of a training set and an evaluation set:\n",
        "\n",
        "* `dftrain` and `y_train` are the *training set*—the data the model uses to learn.\n",
        "* The model is tested against the *eval set*, `dfeval`, and `y_eval`.\n",
        "\n",
        "For training you will use the following features:\n",
        "\n",
        "\n",
        "<table>\n",
        "  <tr>\n",
        "    <th>Feature Name</th>\n",
        "    <th>Description</th>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>sex</td>\n",
        "    <td>Gender of passenger</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>age</td>\n",
        "    <td>Age of passenger</td>\n",
        "  </tr>\n",
        "    <tr>\n",
        "    <td>n_siblings_spouses</td>\n",
        "    <td>siblings and partners aboard</td>\n",
        "  </tr>\n",
        "    <tr>\n",
        "    <td>parch</td>\n",
        "    <td>of parents and children aboard</td>\n",
        "  </tr>\n",
        "    <tr>\n",
        "    <td>fare</td>\n",
        "    <td>Fare passenger paid.</td>\n",
        "  </tr>\n",
        "    <tr>\n",
        "    <td>class</td>\n",
        "    <td>Passenger's class on ship</td>\n",
        "  </tr>\n",
        "    <tr>\n",
        "    <td>deck</td>\n",
        "    <td>Which deck passenger was on</td>\n",
        "  </tr>\n",
        "    <tr>\n",
        "    <td>embark_town</td>\n",
        "    <td>Which town passenger embarked from</td>\n",
        "  </tr>\n",
        "    <tr>\n",
        "    <td>alone</td>\n",
        "    <td>If passenger was alone</td>\n",
        "  </tr>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AoPiWsJALr-k"
      },
      "source": [
        "## Explore the data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "slcat1yzmzw5"
      },
      "source": [
        "Let's first preview some of the data and create summary statistics on the training set."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "15PLelXBlxEW"
      },
      "outputs": [],
      "source": [
        "dftrain.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "j2hiM4ETmqP0"
      },
      "outputs": [],
      "source": [
        "dftrain.describe()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-IR0e8V-LyJ4"
      },
      "source": [
        "There are 627 and 264 examples in the training and evaluation sets, respectively."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_1NwYqGwDjFf"
      },
      "outputs": [],
      "source": [
        "dftrain.shape[0], dfeval.shape[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "28UFJ4KSMK3V"
      },
      "source": [
        "The majority of passengers are in their 20's and 30's."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "CaVDmZtuDfux"
      },
      "outputs": [],
      "source": [
        "dftrain.age.hist(bins=20)\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1pifWiCoMbR5"
      },
      "source": [
        "There are approximately twice as male passengers as female passengers aboard."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-WazAq30MO5J"
      },
      "outputs": [],
      "source": [
        "dftrain.sex.value_counts().plot(kind='barh')\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7_XkxrpmmVU_"
      },
      "source": [
        "The majority of passengers were in the \"third\" class."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zZ3PvVy4l4gI"
      },
      "outputs": [],
      "source": [
        "dftrain['class'].value_counts().plot(kind='barh')\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HM5SlwlxmZMT"
      },
      "source": [
        "Most passengers embarked from Southampton."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "RVTSrdr4mZaC"
      },
      "outputs": [],
      "source": [
        "dftrain['embark_town'].value_counts().plot(kind='barh')\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aTn1niLPob3x"
      },
      "source": [
        "Females have a much higher chance of surviving vs. males. This will clearly be a predictive feature for the model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Eh3KW5oYkaNS"
      },
      "outputs": [],
      "source": [
        "pd.concat([dftrain, y_train], axis=1).groupby('sex').survived.mean().plot(kind='barh').set_xlabel('% survive')\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "krkRHuMp3rJn"
      },
      "source": [
        "## Create feature columns and input functions\n",
        "The Gradient Boosting estimator can utilize both numeric and categorical features. Feature columns work with all TensorFlow estimators and their purpose is to define the features used for modeling. Additionally they provide some feature engineering capabilities like one-hot-encoding, normalization, and bucketization. In this tutorial, the fields in `CATEGORICAL_COLUMNS` are transformed from categorical columns to one-hot-encoded columns ([indicator column](https://www.tensorflow.org/api_docs/python/tf/feature_column/indicator_column)):"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "upaNWxcF3rJn"
      },
      "outputs": [],
      "source": [
        "CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck',\n",
        "                       'embark_town', 'alone']\n",
        "NUMERIC_COLUMNS = ['age', 'fare']\n",
        "\n",
        "def one_hot_cat_column(feature_name, vocab):\n",
        "  return tf.feature_column.indicator_column(\n",
        "      tf.feature_column.categorical_column_with_vocabulary_list(feature_name,\n",
        "                                                 vocab))\n",
        "feature_columns = []\n",
        "for feature_name in CATEGORICAL_COLUMNS:\n",
        "  # Need to one-hot encode categorical features.\n",
        "  vocabulary = dftrain[feature_name].unique()\n",
        "  feature_columns.append(one_hot_cat_column(feature_name, vocabulary))\n",
        "\n",
        "for feature_name in NUMERIC_COLUMNS:\n",
        "  feature_columns.append(tf.feature_column.numeric_column(feature_name,\n",
        "                                           dtype=tf.float32))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "74GNtFpStSAz"
      },
      "source": [
        "You can view the transformation that a feature column produces. For example, here is the output when using the `indicator_column` on a single example:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Eaq79D9FtmF8"
      },
      "outputs": [],
      "source": [
        "example = dict(dftrain.head(1))\n",
        "class_fc = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list('class', ('First', 'Second', 'Third')))\n",
        "print('Feature value: \"{}\"'.format(example['class'].iloc[0]))\n",
        "print('One-hot encoded: ', tf.keras.layers.DenseFeatures([class_fc])(example).numpy())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YbCUn3nCusC3"
      },
      "source": [
        "Additionally, you can view all of the feature column transformations together:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "omIYcsVws3g0"
      },
      "outputs": [],
      "source": [
        "tf.keras.layers.DenseFeatures(feature_columns)(example).numpy()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-UOlROp33rJo"
      },
      "source": [
        "Next you need to create the input functions. These will specify how data will be read into our model for both training and inference. You will use the `from_tensor_slices` method in the [`tf.data`](https://www.tensorflow.org/api_docs/python/tf/data) API to read in data directly from Pandas. This is suitable for smaller, in-memory datasets. For larger datasets, the tf.data API supports a variety of file formats (including [csv](https://www.tensorflow.org/api_docs/python/tf/data/experimental/make_csv_dataset)) so that you can process datasets that do not fit in memory."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9dquwCQB3rJp"
      },
      "outputs": [],
      "source": [
        "# Use entire batch since this is such a small dataset.\n",
        "NUM_EXAMPLES = len(y_train)\n",
        "\n",
        "def make_input_fn(X, y, n_epochs=None, shuffle=True):\n",
        "  def input_fn():\n",
        "    dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))\n",
        "    if shuffle:\n",
        "      dataset = dataset.shuffle(NUM_EXAMPLES)\n",
        "    # For training, cycle thru dataset as many times as need (n_epochs=None).\n",
        "    dataset = dataset.repeat(n_epochs)\n",
        "    # In memory training doesn't use batching.\n",
        "    dataset = dataset.batch(NUM_EXAMPLES)\n",
        "    return dataset\n",
        "  return input_fn\n",
        "\n",
        "# Training and evaluation input functions.\n",
        "train_input_fn = make_input_fn(dftrain, y_train)\n",
        "eval_input_fn = make_input_fn(dfeval, y_eval, shuffle=False, n_epochs=1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HttfNNlN3rJr"
      },
      "source": [
        "## Train and evaluate the model\n",
        "\n",
        "Below you will do the following steps:\n",
        "\n",
        "1. Initialize the model, specifying the features and hyperparameters.\n",
        "2. Feed the training data to the model using the `train_input_fn` and train the model using the `train` function.\n",
        "3. You will assess model performance using the evaluation set—in this example, the `dfeval` DataFrame. You will verify that the predictions match the labels from the `y_eval` array.\n",
        "\n",
        "Before training a Boosted Trees model, let's first train a linear classifier (logistic regression model). It is best practice to start with a simpler model to establish a benchmark."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JPOGpmmq3rJr"
      },
      "outputs": [],
      "source": [
        "linear_est = tf.estimator.LinearClassifier(feature_columns)\n",
        "\n",
        "# Train model.\n",
        "linear_est.train(train_input_fn, max_steps=100)\n",
        "\n",
        "# Evaluation.\n",
        "result = linear_est.evaluate(eval_input_fn)\n",
        "clear_output()\n",
        "print(pd.Series(result))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BarkNXwA3rJu"
      },
      "source": [
        "Next let's train a Boosted Trees model. For boosted trees, regression (`BoostedTreesRegressor`) and classification (`BoostedTreesClassifier`) are supported. Since the goal is to predict a class - survive or not survive, you will use the `BoostedTreesClassifier`.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tgEzMtlw3rJu"
      },
      "outputs": [],
      "source": [
        "# Since data fits into memory, use entire dataset per layer. It will be faster.\n",
        "# Above one batch is defined as the entire dataset.\n",
        "n_batches = 1\n",
        "est = tf.estimator.BoostedTreesClassifier(feature_columns,\n",
        "                                          n_batches_per_layer=n_batches)\n",
        "\n",
        "# The model will stop training once the specified number of trees is built, not\n",
        "# based on the number of steps.\n",
        "est.train(train_input_fn, max_steps=100)\n",
        "\n",
        "# Eval.\n",
        "result = est.evaluate(eval_input_fn)\n",
        "clear_output()\n",
        "print(pd.Series(result))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hEflwznXvuMP"
      },
      "source": [
        "Now you can use the train model to make predictions on a passenger from the evaluation set. TensorFlow models are optimized to make predictions on a batch, or collection, of examples at once. Earlier,  the `eval_input_fn` is  defined using the entire evaluation set."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6zmIjTr73rJ4"
      },
      "outputs": [],
      "source": [
        "pred_dicts = list(est.predict(eval_input_fn))\n",
        "probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts])\n",
        "\n",
        "probs.plot(kind='hist', bins=20, title='predicted probabilities')\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mBUaNN1BzJHG"
      },
      "source": [
        "Finally you can also look at the receiver operating characteristic (ROC) of the results, which will give us a better idea of the tradeoff between the true positive rate and false positive rate."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NzxghvVz3rJ6"
      },
      "outputs": [],
      "source": [
        "from sklearn.metrics import roc_curve\n",
        "\n",
        "fpr, tpr, _ = roc_curve(y_eval, probs)\n",
        "plt.plot(fpr, tpr)\n",
        "plt.title('ROC curve')\n",
        "plt.xlabel('false positive rate')\n",
        "plt.ylabel('true positive rate')\n",
        "plt.xlim(0,)\n",
        "plt.ylim(0,)\n",
        "plt.show()"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "boosted_trees.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
